Xiaodont Tan
In this section, I examined the structure of the dataset, as well as all the variables in the dataset, including the quality and the attributes of the red wine.
There are 13 variables in the dataset, including an index variable X, 12 features of wine, as well as the quality of wine.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The qualities of the wines are ranging from 3 to 8. Most of them are of quality 5 and 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
I grouped the wine quality into low (3~5) and high (6~8), each category containing about half of the dataset.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
##
## Low High
## 744 855
The volatile acidity is ranging from 4.6 to 15.9, roughly following a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The citric acid ranges from 0 to 0.8, with a few outliers at around 1. The data is right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The normal range of residual.sugar is 0.9 to 9.0. Again, there are a few outliers with values much larger than this range (from 13 to 16).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
As the data has long-tail, the data was transformed to log data to have a better understanding of its distribution.
The normal range of cholorides is 0.012 to 0.3. Again, there are a few outliers with values much larger than this range (from 0.4 to 0.6).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
As the data has long-tail, the data was transformed to log data to have a better understanding of its distribution.
The volatile acidity is slightly right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The total sulfur dioxide is right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The total sulfur dioxide is right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The density roughly follows a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The pH value roughly follows a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The level of sulphates is ranging from 0.33 to 1.5, with some outliers over 1.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Alcohol is right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
There are 1599 red wine observations in the dataset with 13 variables, including an index variable (named “X”), the “quality” variable, and 11 other variables describing the chemical attributes of red wine.
The quality of the wine is an integer. It is a discrete value.
All the chemical attributes are floating numbers. They are of different unit and therefore lie in widely different range.
Quality of the wine is the main feature of interests. From common sense, I would expect alcohol also plays an important role in the quality of the wine.
All the other features of wine are potentially linked to its quality. From the description of the variables, I would expect volatile acidity and citric acid have influence on the quality.
I grouped the quality data into high quality group and low quality group, each containing around half of the dataset.
In this section, I explored the relationship between different variables. I In particular I plotted a few relatively strong relationships between different wine attributes, as well as between attributes and wine quality.
The correlation between different variables in the dataset is shown below.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
The strengths of the correlation relationships are shown in the chart below.
The correlation matrix suggests that fixed.acidity is strongly positively correlated with citric.acid and density (r= 0.67), strongly negatively correlated with ph ( r = -0.68).
The higher the level of fixed acidity is, the higher the level of citric acid is.
The higher the level of fixed acidity is, the higher the density is.
The higher the fixed acidity is, the lower the pH level is.
The relationship between free.sulfur.dioxide and total.sulfur.dioxide is also strong (r = 0.67).
The negative correlation relationship between density and alcohol is also relatively strong ( r = -0.5). The higher the density is, the lower the alcohol level is.
The correlation matrix suggestest that volatile.acidity, alcohol, citric acid and sulphates are weekly correlated with quality (r = -0.39, 0.48, 0.23 and 0.25 respectively)
The higher the wine quality is, the higher the alcohol level is (there is an exception for wine with quality 5).
This difference between quality wine and high quality wine is statistically significant.
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Wilcoxon rank sum test with continuity correction
##
## data: wine$alcohol by wine$quality.rank
## W = 154810, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The higher the quality is, the lower the level of volatile acidity is. The difference between low and higher quality wine is statistically significant.
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Wilcoxon rank sum test with continuity correction
##
## data: wine$volatile.acidity by wine$quality.rank
## W = 438910, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The higher the wine quality is, the higher the level of citric acid is. The difference between low and high quality wine is statistically significant.
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Wilcoxon rank sum test with continuity correction
##
## data: wine$citric.acid by wine$quality.rank
## W = 259850, p-value = 2.555e-10
## alternative hypothesis: true location shift is not equal to 0
The higher the wine quality is, the higher the level of sulphates is. The difference between low and higher quality is statistically significant.
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Wilcoxon rank sum test with continuity correction
##
## data: wine$sulphates by wine$quality.rank
## W = 195150, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The more alcohol, citric.acid, sulphates and the less volatile acidity the wine contains, the higher its quality is.
I obeserved that fixed acidity has strong correlation with a few other attributes. It is strongly positively correlated with citric.acid and density (r= 0.67), negatively correlated with ph (r = -0.68).
There is also a strong positive correlation between free.sulfur.dioxide and total.sulfur.dioxide.
The strongest correlation I found was the one between pH and fixed.acidity r = -0.68
The graph shows the relationship between alcohol and fixed acidity for different wine quality. When the fixed acidity is not very high (4 ~ 10), the alcohol level of high quality wine is higher than that of low quality wine. This is inline with the previous observation that the alcohol level and wine quality are positively correlated. When the fixed acidity is very high (>13), however, the low quality wine has more alcohol than high quality wine.
When the level of residual sugar is low, the level is positively correlated with density, and the density of low quality wine is mostly higher than of high quality wine. However, when the level of residual sugar is higher (>4), the patterns disappear. (Outliers are removed from the chart)
When the level of sulphate is low, it is positively correlated with density, and the density of low quality wine is mostly higher than of high quality wine. However, when the level of sulphate is higher, the patterns disappear. (Outliers are removed from the chart)
In model 1, all the attributes were used as predictors, the adjusted R-squared was only 0.3561.
##
## Call:
## lm(formula = quality ~ fixed.acidity + citric.acid + residual.sugar +
## chlorides + volatile.acidity + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.197e+01 2.119e+01 1.036 0.3002
## fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
## citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
## residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
## chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
## volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
## free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
## total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
## density -1.788e+01 2.163e+01 -0.827 0.4086
## pH -4.137e-01 1.916e-01 -2.159 0.0310 *
## sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
## alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
In model 2, only the attributes that are significant predictors in model 1 were used as predictors. However, the adjusted R-squared was only increased to 0.3567, which is not good enough.
##
## Call:
## lm(formula = quality ~ chlorides + volatile.acidity + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68918 -0.36757 -0.04653 0.46081 2.02954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4300987 0.4029168 10.995 < 2e-16 ***
## chlorides -2.0178138 0.3975417 -5.076 4.31e-07 ***
## volatile.acidity -1.0127527 0.1008429 -10.043 < 2e-16 ***
## free.sulfur.dioxide 0.0050774 0.0021255 2.389 0.017 *
## total.sulfur.dioxide -0.0034822 0.0006868 -5.070 4.43e-07 ***
## pH -0.4826614 0.1175581 -4.106 4.23e-05 ***
## sulphates 0.8826651 0.1099084 8.031 1.86e-15 ***
## alcohol 0.2893028 0.0167958 17.225 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared: 0.3595, Adjusted R-squared: 0.3567
## F-statistic: 127.6 on 7 and 1591 DF, p-value: < 2.2e-16
When the fixed acidity is not very high (4 ~ 10), the alcohol level of high quality wine is higher than that of low quality wine. This is inline with the previous observation that the alcohol level and wine quality are positively correlated. When the fixed acidity is very high (>13), however, the low quality wine has more alcohol than high quality wine.
For residual.sugar and sulphates, when the its level is low, the level is positively correlated with density, and the density of low quality wine is mostly higher than of high quality wine. However, when the levels are higher, the patterns disappear.
I created multiple regression model on the quality of wine. In model 1, all the attributes were used as predictors, the adjusted R-squared was only 0.3561. In model 2, only the attributes that are significant predictors in model 1 were used as predictors. However, the adjusted R-squared was only increased to 0.3567, which is not good enough.
The two models took all the possible attributes into consideration. However, some attributes are correlated to each other, which might influence the goodness of the model. The influence of some attributes might also not be linear.
One major finding of the project is that the alcohol level is an important indicator of the wine’s quality. Wines with higher quality (quality 6, 7, 8) contains much more alcohol than wines with lower quality (quality 3, 4, 5).
The relationship between alcohol level and wine quality has interaction with other attributes of the wine. For example, when the level of fixed acidity is below 10, high quality wine has higher level of alcohol. When the level of fixed acidity is above 10, however, the pattern disappears.
The plot is also an example that the relationship between different attributes might only hold in a certain range. For example, the level fixed acidity is negatively related to the level of alcohol when the level of fixed acidity is below 8. When the level of fixed acidity is above 8, however, the pattern disppears.
The plot below shows another example of interaction between different variables.
The low quality wine has lower level of sulphates, as there are more red dots at the left of the plot.
When the level of sulphates is low (below 1), low quality wine has higher density, as the red line is above the green line. At the same time, the sulphates level is roughly positively related to density.
When the level of sulphates is high, no apparent patterns were identified.
Correlation coefficient shows that volatile.acidity, alcohol, citric acid and sulphates have stronger correlation with the wine quality. Higher quality wine has higher level of alcohol, citric acid and sulphates and lower level of volatile acidity.
In the multiple regression model, however, the significant predictors of wine quality include alcohol, volatile acidity, sulphates, chlorides, ph, free.sulfur.dioxide and total.sulfur.dioxide.
Plotting shows that in fact, the relationship between some factors are not linear, some correlation relationship only holds in a certain range. Some relationships also interacts with other factors.
In this dataset, the majority of the wine are of quality 5 and 6, which could actually be categorized as “medium” quality wine if 3,4 is categorized as low quality and 7,8 is categorized as high quality. However, that would make the size difference between different groups too big. As a result, I grouped the wine quality into low (3,4,5) and high (6,7,8) instead. The two group analysis might not be able to capture the features of medium quality wine.
In this project, although log transformation were conducted for some varibles in univariable analysis, bivariate and muitivariate analysis were conducted only on the original data, not the transformed data.Further models can be built on some transformed data.
Due to the unfamilarity with the ggplot, I struggled a lot with choosing the right geom and the parameters. Some of the plots could be fine tuned to look nicer or more explicit.
Besides, a research on the red wine might give more insights on which variables to focus on and how to interpret the findings.
http://www.jerrydallal.com/lhsp/logs.htm https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html http://stackoverflow.com/questions/8460257/constraining-stat-smooth-to-a-particular-range https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/grid.html http://docs.ggplot2.org/current/labs.html http://www.sthda.com/english/wiki/add-legends-to-plots-in-r-software-the-easiest-way http://docs.ggplot2.org/current/geom_jitter.html https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html http://docs.ggplot2.org/current/scale_continuous.html